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ABSTRACT 



A graphical test bed in which the results of a simula- 
tion experiment can be reported and analyzed is described. 

The test bed is based on the regression adjusted graphics 
and estimation methodology developed by Heidelberger and 
Lewis [Ref. 1] for regenerative simulation. From the 
graphics and associated numerics, the experimenter can sum- 
marize and see simultaneously relative properties, such as 
bias, normality and standard deviation, of several estimators 
of a characteristic of a population for up to 8 sample sizes. 
The evolution of these properties with sample size is also 
displayed. The graphics is supported on a line printer to 
make it and the program portable. The technique is illus- 
trated by examples concerning the effects of changes in 
data distribution on the behavior of the lag one serial 
correlation coefficient, the estimation of the shape 
parameter of Gamma random variables and a comparison of 
different methods (jackknife, bootstrap) for estimating the 
standard error of an estimator. 
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I. SYNOPSIS 



SIMTBED 



THE PROGRAM: 

Portable FORTRAN program using printer plot graphics 
(3 different program versions) 

Program will run on: 

IBM 

VAX 

IBM PC 
etc . 

ca. 900 lines of FORTRAN Code 
Memory requirements: 

SIMTBl 1 M Bytes 

SIMTB2 1 M Bytes 

SIMTB3 0.5 M Bytes 

(may slightly differ with different type of estimator 
functions and subsample sizes) 

PURPOSES: 

To explore the distribution of a statistical estimator 
To see how that distribution changes with sample size 
To compare that distribution with the distribution of 
competing estimators 
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THE USER SUPPLIES: 



A. Optional Parameters; 

NE ( 1) , NE ( 2 ) , . . . , NE ( 8) = Subsample Sizes (maximum is 8) 

The estimator is computed based on NE ( i) data points 
N = Total Number of simulated data points per 

replication 

At Subsample Size NE ( i ) , there are [N/NE(i)J 
independent values of the estimator 
M = Number of Replications 

When all replications have been run, there are 
M*jN/NE(i)J independent values of the estimator 
at each NE ( i) 

D = Degree of Regression (maximum D = 3) 

L = Number of Subsample Sizes (maximum L = 8) 

GRAPHICS and SCALING options 

B. Data: 

A total of M*N simulated data values are needed. In 
SIMTBl and SIMTB2 the same data is used at each 
subsample size (NE(i)) . In SIMTB3 new data is always 
generated. 

C. Estimators: 

Up to 3 FORTRAN functions (i.e. Estimators) are needed. 
They must accept as inputs a data subsample and the size 
of that subsample. They must return one value of the 
estimator for that subsample. 
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SIMTBED PRODUCES : 



A one page graphical output (box plots) at each 
subsample size 

Numerical Summaries at each subsample size (mean, 

Std. Dev., Std. Dev. of the mean, skewness, kurtosis) 
Regressions to quantify changes in the mean and 
variance as subsample size changes 
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II. INTRODUCTION 



SIMTBED, with the different versions (SIMTBl, SIMTB2 
and SIMTB3) is a graphical display program. The program 
is based on the program RAGE [Ref. 2]. It is used, with 
simulated data, on a digital computer. The program can be 
used to examine statistical estimators of different type, 
or properties of a single estimator under different distri- 
butional assumptions . The distribution of the estimator can 
be explored for given sample sizes and the properties can be 
compared for different sample sizes. The estimation condi- 
tions are controlled by the experimenter. It is also possible 
to examine the effects of changes in the underlying distri- 
bution of the data. 

When the program SIMTBl or SIMTB2 is used with simulated 
data, the data is assumed to be independent and identically 
distributed (iid) . This iid data can be sectioned into 
M independent blocks of specified sample size N. The sample 
of size N, will be sectioned into smaller subsample of size 
NE(k). The estimates are then calculated from this subsample 
of size NE (k) . 

One salient feature of the program versions SIMTBl and 
SIMTB2 is that they use the same batch of simulated random 
variables to explore the properties of all the estimators 
at the various subsample sizes. This is done for economy 
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and could be important on slow computers; the price paid is 
that the regression analysis provided by SIMTB1 and SIMTB2 
of its graphical output is performed on correlated samples. 

The version SIMTB3 uses one dimensional data and does 
not have this repetition feature. New data is used for each 
calculation of each estimator at each subsample size. The 
data is generated when the estimator function is called and 
only the needed batch of the exact subsample size is gener- 
ated. This technique reduces the memory requirements. 

Moreover all data sets are uncorrelated and a much more 
precise correlation can be performed if required. But this 
technique increases the computer time. 

To use the program it is necessary only to define the 
optional parameters (see Section IV) , supply the simulated 
random variables or a batch of data points, and provide the 
FORTRAN functions for the calculation of the estimators 
which are of interest. The program itself will subdivide 
the input data and feed the data properly into the functions, 
scale the graphic display, produce boxplots and summary 
statistics. A regression will be computed for the mean and 
variance of each estimator based on inverse subsample size. 
The result of the regression is displayed graphically and 
numerically . 

The program is written in ANSI Standard FORTRAN (X3.9- 
1966) and extensively tested on an IBM 3033 computer using 
FORTRAN IV (H Extended) or FORTRAN IV (Gl) compilers. The 
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program SIMTBED (all versions) provides all subroutines 
used inside the program. Besides the estimator functions, 
the user has to provide NO additional subroutines via 
packages like IMSL. The program should be totally portable; 
it has been tested under FORTRAN 77 on a VAX 11/780. The 
only limitation is the available memory. 



16 



III. GENERAL IDEA AND DATA STRUCTURE OF THE PROGRAM 



The main purpose of SIMTBED, with the different versions, 
is to explore the distributional behavior of estimators 
and show their properties in a graphical and numerical 
display. All versions use the same ideas, they only differ 
in the type of data and the way the data is used inside the 
program. 








Figure 1. Sectioning of the Data into M*N Sections 



17 



To study the behavior of an estimator the experimenter 
usually uses well known simulated data. Thus if one is 
interested in exploring the behavior of estimates of the 
shape parameter in a Gamma population, one generates Gamma 
variates from a random number generator package (e.g., IMSL 
subroutine Chap. G) . This batch of simulated data is 
processed by SIMTBED in the following way. 

The data batch is first divided into M independent 
blocks. Each block contains N data points. So the starting 
data batch has to consist of M*N data points (see Figure 1) . 

All blocks are divided into subsamples of size n^ . The 

actual subsample size n^ is an element of the subsample 

size array NE . This array can store up to 8 different 

values. Then the estimator is calculated for each subsample 

of size n.. The estimator function will be calculated 
1 

([N/n^])*M times. This total population of estimates is used 
to evaluate the summary statistics for the estimate and 
construct the corresponding box plot. If NE contains 
another element, the blocks are divided into the new sub- 
sample n^ + ^ size and all calculations are done again. 

In addition to the summary statistics and the box 
plots (see e.g., Figure 3a), a regression on the averages 
and on the variance is computed, following the methodology 
of regression adjusted estimate (RARE) developed by 
Heidelberger and Lewis [Ref. 1] . 
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The RARE estimate is the regression coefficient cig . 

It is the asymptotic estimate of the expected value of the 
parameter. The unbiased RAGE estimate of the average of the 
parameter is determined by the regression formula: 

E ( 0 ( n j_ ) ) = a 0 + a i ~ + a 2 — I + * ’ ’ + a D “D 

1 n i n i 

where D is an input parameter. 

The asymptotic RARE estimate is printed as a dashed line 
in the graph. 

The unbiased RAGE estimate of the variance of the esti- 
mator is given by the formula: 



S 2 (n ± ) 



6 1 ~ + 
1 



(n.) 



1.5 



+ 6 . 



(n.) 



The regression calculations of the average are done for 
each block separately. The result are M sets of regression 
coefficients. The finally printed regression coefficients 
are the averages of the M replications (see Figure 2) . From 
the set of coefficients the variance and the standard 
deviation of the regression coefficients is calculated. 

The regression on the variance is done once, using all 
(M*N) data points. 
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Figure 2. Structure of the Regression Coefficients 
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IV. ARGUMENTS OF THE PROGRAMS SIMTBl AND SIMTB2 



All versions of SIMTBED have in general the same argu- 
ment list. SIMTB2 (a version for multivariate random varia- 
bles) has two additional arguments. All arguments will be 
described in detail and all restrictions or limitations will 
be discussed. In Section V, detailed examples of setups 
for SIMTBl and SIMTB2 are given. 

A. X — DATA ARRAY 

The array X is the input data array. X is single pre- 
cision REALM, and is generally simulated data e.g., in 
Section VII, independent Gamma variates. 

In SIMTBl the dimension of the array X must not exceed 
50,000. 

In SIMTB2 the input data is multivariate and conse- 
quently the array X has two dimensions. The size of the 
dimensions is not directly limited by the program. The 
space must be provided by the calling program and is passed 
as an argument (see IR and IRK) . The memory requirements 
increase rapidly as the dimensions of X increase. 

B. N — SAMPLE SIZE 

The sample size N is the number of data points per section 
of input data X. N is an INTEGER. Depending on the pre- 
cision of the simulation that is required, the sample size 
N can vary from 1 to 50,000. 
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In SIMTB2 N is the number of multivariate data points per 
section. M (the number of replications) times N must not 
exceed IR, the row dimension of X. 

If M times N exceeds 50,000 an error message will be 
printed and the execution will be terminated. If the product 
M times N exceeds the total number of data points provided 
by the user, NO error message will signal the user error and 
the result of the execution is not predictable. 

C. M — NUMBER OF REPLICATIONS 

The number of replications M of the array X is an 
INTEGER. M determines the number of sections, into which 
the data set X is divided. So M*N is the dimension of X in 
SIMTBl. The parameter M also determines the number of 
regressions on the mean values that will be run to find the 
regression coefficients in the regression on the mean. If 
M is 1 only one regression will be done and no variance and 
standard deviation of the regression coefficients can be 
calculated. 

D. NE — SUBSAMPLE SIZE ARRAY 

The argument NE is an INTEGER array, containing up to 
8 subsample sizes. These are the subsample sizes at which 
the properties of the estimator are to be investigated. NE 
must always contain 8 elements. If less than 8 subsample 
sizes are used, the array must be filled up with dummy 
values. The estimator is calculated for each subsample 
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individually. The used values of NE (not the dummy values) 
have to be ordered from the smallest to the largest 
(NE (1) < NE (2) < NE (3) < . . . ) . 

The program can only handle up to a total number of 
12,500 estimates. This limit is caused by storage limita- 
tions and can be expanded for larger computers. The smallest 
subsample size NE(1) produces the most estimates and if this 
case exceeds 12,500 the execution will be terminated and an 
error message will be printed. 

In SIMTB2 all values of NE have to be less than 5,000. 

E. L — NUMBER OF SUBSAMPLE SIZES (BOX PLOTS) 

The number of box plots L in the graphical output (the 
number of subsamples at which the estimator is examined) , 
is an INTEGER with values from 1 to 8 . The value of L 
determines how many elements of the array NE the program 
will use (e.g., L = 2, the program uses only NE(1) and NE(2) 
for the calculations) . L determines the number of box plots 
and the number of corresponding summary statistics in the 
output. If L is out of range the program will terminate 
execution and print an error message. 

For L = 1 no regression is possible. No regression 
output will be printed. 

F. D — DEGREE OF REGRESSION 

The degree of regression on the mean D, is an INTEGER 
with values from 1 to 3. The chosen degree refers to the 
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number of coefficients calculated and printed for the 
regression equations. Experience has shown that D = 3 is 
preferable (higher values cause severe numerical problems in 
the regression computations). 

If D exceeds the value of L-l, the program reduces D 
to this number, regardless of the value chosen by the user. 

G. RG— REDUCED GRAPHICS 

The argument RG is an INTEGER with value; 

0 = off ==> full graphics 

1 = set -=> reduced graphics 

In the reduced graphics case the vertical scale of the 
graph is reduced by eliminating the outliers. Only the 
number of outliers is counted and the number is printed. 

This enables the user to see the body of the boxplot in more 
detail . 

H. SEI — SCALING ESTIMATORS INDIVIDUALLY 

The argument SEI is an INTEGER with value: 

0 = off ==> common scaling 

1 = set ==> individual scaling. 

For SEI = 0 all printed graphs (max. 3 per program 
run) are scaled to the same range. This makes the compari- 
son of the different graphs easier. The value SEI = 1 causes 
the program to scale each graph individually. 

The argument SEI works in combination with the arguments 
RG and SVS . The combination of these three arguments make it 
possible to fit the printed graphs to the needs of the user. 
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I. SVS — SETTING THE VERTICAL SCALE 

The argument SVS is an INTEGER with value: 

0 = off ==> automatic scaling 

1 = set ==> scaling by the user. 

SVS = 0 causes the program to calculate the scale of 
the graphic printout. The final graphic display is influ- 
enced by the chosen values of RG and SEI . The values of YMIN 
and YMAX are ignored by the program. 

For SVS = 1 the user must provide the scaling. The 
graphics are vertically scaled to a given minimum value 
(YMIN) and maximum value (YMIN and YMAX) . The arguments of 
RG and SEI are ignored. Only the numbers of outliers outside 
the display are printed for each box plot. 

J. YMIN — MINIMUM VALUE OF THE VERTICAL SCALE 

The scaling parameter YMIN is data type REAL*4. It is 
the lower limit of the vertical scale. It affects the 
scaling only in combination with SVS =1. If the chosen 
value is so large that the value YMIN lies inside the body 
of the box plot, an error occurs and the program ends with 
an abnormal ending. 

K. YMAX— MAXIMUM VALUE OF THE VERTICAL SCALE 

The scaling parameter YMAX is data type REAL*4. It is 
the upper limit of the vertical scale. The scaling is only 
effected if SVS has value 1. If the value of YMAX is too 
small and it lies inside the body of the box plot, the 
program comes to an abnormal ending. 
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YMIN and YMAX in combination with SVS = 1 should only 
be used for well known graphic ouptut. With this option, 
it is possible to scale the output of different program 
runs to a common scale. In particular if more than 3 esti- 
mators have to be estimated and compared, so that a common 
scale is needed, this option may be used. 

L. NEST— NUMBER OF ESTIMATORS 

The parameter NEST is an INTEGER with the value 1, 2 
or 3. The value of NEST determines the number of different 
one-page graphic displays the program produces with one call 
from the calling program. Usually the value of NEST is 
equal to the number of different estimators used in the 
program. 

In SIMTB2 the same estimator may be used with different 
(e.g., normal, exponential etc.) distributed data sets. 

M. ESTl , EST2, EST3 — ESTIMATOR FUNCTIONS 

These 3 parameters are data type REAL *4 and are used 
to pass the EXTERNAL function names to the program. An 
external declared function is a function which computes the 
value of an estimator. This function may call other routines, 
but the final value of the estimator must be passed over 
this function name. 

For SIMTBl each function must have the two arguments X 
and NEK (e.g., FUNCTION VARIANCE (X ,NEK) ) . X is the data 
array and NEK is the number of data elements in X. 
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SIMTB2 needs for each function four arguments X, NEK, 

IDR, IRK (e.g., FUNCTION CORRELATION (X, NEK, IDR, IRK) ) . X 
is a two dimensional data array and NEK is the subsample 
size, for which the function is evaluated. IDR and IRK are 
the dimensions of the array X. 

If the user wants to use less than 3 estimator functions, 
he must use dummy arguments and choose the correct value of 
NEST. The easiest way to do this is to use a function of 
a previous used estimator again (for details of the pro- 
gramming see Section V) . 

N. TTLl , TTL2, TTL2 — DESCRIPTION OF THE ESTIMATORS 

These arrays are used to pass titles from the calling 

program to SIMTB. Each title has to be 120 characters long 
(15 fields, each 8 characters wide). The title is printed 
below the output of the corresponding function. 

If the user uses less than 3 functions and titles, he 
has to use dummy arguments (for details see Section V) . 

If the titles are not in the correct format, correspond- 
ing to the FORMAT statement the program will not execute 
properly . 

O. IR, IRK — DIMENSIONS OF THE ARRAY X 

These arguments are used ONLY in SIMTB2 . IR Crow dimen- 
sion) and IRK (column dimension) are INTEGERS and have to 
match the dimension of X in the calling program. 

If the dimensions don't match, NO error message will be 
produced and the output is unpredictable. The error may 
not be obvious , and the ouput may look reasonable . 
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V. DUMMY EXAMPLES OF IMPLEMENTING SIMTB AND SIMTB2 

INTO A DRIVER-PROGRAM 



The following examples show how the programs SIMTB1 or 
SIMTB2 can be implemented into a FORTRAN driver-program. 

The only purpose of the examples is to clarify the FORTRAN 
implementation, to avoid programming errors by the user. 
Additional comment lines are added to the program examples . 

In the first example SIMTBl is used to compare two 
different estimators (VAR and UNBVAR) of the variance of a 
normal sample. The sample may be generated with a random 
number generator (e.g., LLRANDOMII) . 

In the second example SIMTB2 is used to compare the 
distribution of two estimators. The estimators are the 
covariance (COV) and the correlation coefficient (CORR) of 
a bivariate standard normal sample. 

A. EXAMPLE 1 USING SIMTBl 
MAIN 

C EXAMPLE of SIMTBl Calling program, it has not to be 
the MAIN program 

REAL* 4 X ( 50000) , YMIN, YMAX, VAR, UNBVAR 
REAL *8 Tl ( 15) , T2(15), T3(15) 

INTEGER N, M, NE(8), L, D, RG, SEI , SVS , NEST 
C 

DATA N /2500/ 
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DATA M / 20/ 

DATA NE 710,20,25,50,100,0,0,0/ 

C Array NE must have 8 elements, if only 5 are used, add 
C dummy variables and set L = 5 
DATA L / 5/ 

DATA RG / 0/ 

DATA SEI / 0/ 

DATA SVS / 0/ 

DATA NEST/ 2/ 

DATA Tl / 'ESTIMATE ', 'OF THE V ' , ' ARIANCE ' , ' US ING ', 
+ ' VAR= ( 1/N) ' , ' *SUM( X ( I ' , ' ) -XBAR) * ' , ' *2 ',7*' '/ 

DATA T2 /'ESTIMATE' , 'OF THE V', 'ARIANCE ', 'USING ', 
+ ' VAR= (1/(1', ' -N) ) *SUM ' , ' (X ( I) -XB ' , ' AR) **2 ',7*' '/ 

C All 15 fields (each 8 characters) have to be 
C initialized 
C 

EXTERNAL VAR, UNBVAR 
C 

C Generate M*N independent Normal (.0,1) Random numbers 
C and store into X 
C 

CALL SIMTB1 (X,N, M,NE,L,D,RG, SEI, SVS, YMIN, YMAX, NEST 
, VAR, Tl, UNBVAR, T2 ,VAR,T1) 

C EST3= VAR and TTL3=T1 used as dummy variables 
C 

STOP 

END 



29 



c 

FUNCTION VAR (X, NEK) 

C Function to calculate the Variance. 

C All calculations inside the function should be done in 
C DOUBLE PRECISION 
C 

REAL *4 X(N) , VAR 
REAL* 8 SUM, XBAR , DVAR 
C 

SUM=0 .0D0 
DO 10 1=1, N 

S UM= S UM+ DB LE (X ( I ) ) 

10 CONTINUE 

XB AR=S UM/FLOAT ( N ) 

SUM=0 .0D0 
DO 20 1=1, N 

SUM=SUM+( (DBLE (X ( I) ) ) -XBAR) **2 
20 CONTINUE 

DVAR=S UM/FLOAT (N) 

VAR=SNGL ( DVAR) 

C 

RETURN 

END 

FUNCTION UN B VAR (X ,NEK) 

C Function to calculate the Variance. 

C All calculations inside the function should be done in 



30 



C DOUBLE PRECISION 



C 

REAL *4 X (N) , UN B VAR 
REAL *8 SUM, XBAR, DUNVAR 
C 

SUM=0 . 0D0 
DO 10 1=1, N 

S UM= S UM+ DB LE (X (I) ) 

10 CONTINUE 

XBAR= SUM/FLOAT (N) 

SUM=0 .0D0 
DO 20 1=1, N 

SUM=SUM+ ( ( DBLE ( X ( I ) ) ) -XBAR) * * 2 
20 CONTINUE 

DUNVAR= S UM/FLOAT ( N ) 

UNBVAR=SNGL ( DUNVAR) 

C 

RETURN 

END 

B. EXAMPLE 2 USING SIMTB2 
MAIN 

C EXAMPLE of SIMTB2 Calling program, it has not to be 
the MAIN program 

REAL* 4 X ( 25000 , 2) , YMIN , YMAX , COV, CORR 
REAL *8 Tl (.15) , T2(15), T3(15) 
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INTEGER N, M, NE(8), L, D, RG, SEI, SVS, NEST, IR, 

+ IRK 

C 

DATA N /250 0/ 

DATA M / 10/ 

DATA IR /25000/ 

DATA IRK / 2/ 

C IR and IRK must be equal to the dimensions of X 
DATA NE /10, 20,25,50, 83, 100,125,250/ 

DATA L / 8/ 

DATA RG / 0/ 

DATA SEI / 0/ 

DATA SVS / 1/ 

DATA YMIN/ 0.0/ 

DATA YMAX/ 1.0/ 

DATA NEST/ 2/ 

DATA Tl /'ESTIMATE' , 'OF THE C ' , ' OVARIANC ' , ' E 
+ 11 *' '/ 

DATA T2 /'ESTIMATE ', 'OF THE C ' , ' ORRELAT 1 ' , ' ON COEFF ' , 
+ ’ ICIENT ’,’10*’ '/ 

C All 15 fields (each 8 characters) have to be 
C initialized 
C 

EXTERNAL COV, CORR 
C 

C Generate M*N pairs of independent random bivariate 
C numbers, each pair being independent N(0,1) random 
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C variables and store into X 



C 

CALL SIMTB2 ( X , N ,M , NE , L , D , RG , SEI , SVS , YMIN , YMAX NEXT 
+ , COV, Tl , CORK, T 2 , COV, Tl , IR , IRK) 

C EST3=COV and TTL3=Tl used as dummy variables 
C 

STOP 

END 

C 

C 

FUNCTION COV (X, NEK, IDR, IRK) 

C Function to calculate the Covariance. 

C All calculations inside the function should be done in 
C DOUBLE PRECISION 
C 

REAL*4 X( IDR, IRK), COV 

REAL* 8 SUMl , SUM2 , SUM3 , XBARl , XBAR2 , EX1X2 , DCOV 
C 

SUMl=0 . ODD 
SUM2=0 . 0D0 
SUM3=0 . 0D0 
DO 10 1=1, N 

SUMl=SUMl+DBLE (X ( I , 1) ) 

SUM2=SUM2+DBLE (X ( I , 2 ) ) 

SUM3=SUM3+DBLE (X ( I , 1) *X(I,2) ) 

10 CONTINUE 

XBARl=SUMl/FLOAT (N) 
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XB AR2 = S UM 2 / FLOAT (N) 
EXlX2=SUM 3/FLOAT (N) 
DC0V=EX1X2- ( XB AR 1 *XB AR2 ) 
COV=SNGL ( DCOV) 

RETURN 

END 



C 

FUNCTION CORR ( X, NEK, IDR, IRK) 

C Function to calculate the Correlation coefficient 
C All calculations inside the function should be done in 
C DOUBLE PRECISION 
C 

REAL *4 X( IDR, IRK), CORR 

REAL* 8 SUMl , SUM2 , SUM3 , XBARl , XBAR2 , EX1X2 , VARl , VAR2 , 

+ COV, DCORR 

C 

SUMl=0 . 0D0 
SUM2=0 . 0D0 
SUM3=0 . 0 DO 
DO 10 1=1, N 

SUMl=SUMl+DBLE (X ( I , 1) ) 

SUM2=SUM2+DBLE (X ( I , 2 ) ) 

SUM3=SUM3+DBLE (X ( I , 1) *X(I,2) ) 

10 CONTINUE 

XBARl=SUMl/FLOAT (N) 
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XBAR2-SUM2/FLOAT (N) 
EX1X2=SUM3/FL0AT (N) 

SUM1=0 . ODO 
SUM2=0 . ODO 
DO 20 1=1, N 

SUM1=SUM1+DBLE (X ( 1 , 1) **2) 
SUM2=SUM2 + DBLE ( X ( 1 , 2 ) **2) 

20 CONTINUE 

VARl= ( S UM1/FLOAT (N) ) - ( XBARl * * 2 ) 
VAR2= (SUM2/FLOAT (N) ) - (XBAR2**2) 
C0V=EX1X2- ( XBARl *XBAR2) 
DCORR=COV/( ( VARl *VAR2 ) **0.5) 
CORR=SNGL (DCORR) 



C 



RETURN 

END 
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VI. STUDY OF THE BEHAVIOR OF SERIAL CORRELATION 
ESTIMATES FOR DIFFERENT DISTRIBUTIONS 



A. CALCULATION OF THE FIRST SERIAL CORRELATION COEFFICIENT 

It is known that for an independent sample from a 

population with finite variance, the distribution of the 
serial correlation coefficient (Anderson and Walker, 1964) 
[Ref. 3] is asymptotically Normal with mean zero and vari- 
ances 1/n, where n is the sample size. If the population is 
i.i.d Normal then the bias is exactly -1/n. Since those 
asymptotic properties are frequently used as approximations 
in tests of significance, it is important to know how valid 
the approximation would be in small samples from a variety 
of distributions. We will look at that question in the next 
two sections and then go on to consider two alternative 
measures of correlation, Fisher's z-transform and the 2-fold 
jackknifed estimate of the correlation. Their ability to 
reduce bias and/or induce Normality will be examined against 
other changes in the distribution of the estimators, particu- 
larly variance inflation. A simulation study, without 
graphics, of some of these problems was conducted by Cox 
(1966) [Ref. 4] . 

B. SIMTB1 OUTPUT FOR SERIAL CORRELATION 

Figure 3(a) shows the simulated distribution and sample 
properties of the serial correlation coefficient estimate 
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nj (X -X)( X -X n ) 

3 =1 J : 

n _ p 

(n-l) l (X - X r 

j-1 J 

X = I X / n , 

U j = l 3 

n-l 

X, = y X ./ (n-l) , and 

j=i 3 

x n - l X /(n-l) 

n j = 2 11 

for various sub-sample sizes n = n^ . This definition matches 
that used by Anderson and Walker (1964). We consider first 
subsamples of size n^ = 10, and then of size ^ = 20, n^ = 30, 
n^ = 40, n^ = 50, ng = 75, n_, = 100 and ng = 150, successively. 
For each subsample size the input sample of N = 5000 simulated 
Normal (0,1) random variables is divided into as many full 
subsamples of size n^ as possible, and the serial correlation 
is computed for each of the [N/n^j subsamples of size n^. 

The entire procedure is then replicated M = 10 times , each 
time with a new simulated sample of N = 5000 Normal (0,1) 
variables . 

After all M replications have been run, all the estimates 
of serial correlation for each subsample size are grouped 
together and their simulated distribution is presented via a 



n 



where: 
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boxplot and summary statistics. The boxplot follows the 
standards discussed in Mosteller and Tukey (1977) [Ref. 5] 
with the median denoted by a + within the box, the mean by a 
* within the box, the outliers by 0's, and the far outliers 
by *'s beyond the whiskers. The summary statistics include 
the sample mean, sample standard deviation, estimated standard 
deviation of the sample mean (i.e., sample standard deviation/ 
sqrt (M jN/n^J ), sample skewness and sample kurtosis of the 
correlation estimates. 

Looking at the output, the first (leftmost) boxplot in 
the graph in Figure 3(a) shows the distribution of 



(# Replications) * 



-Simulation , 
Sample Size 
-Subsample, 

1 Size ' 



10 x 



5000 

10 



10 x 500 = 5000 



estimates of serial correlation from independent subsamples 
of size n^ = 10. Summary statistics for the boxplot can be 
found below the graph in the column labeled "Subsample Size 
10," so that the average serial correlation is -.1074, and 
the estimated standard deviation is .2996. The estimated 
standard deviation of the serial correlation estimate is 
. 299 6// ( 5000) = .00424. Recall that this refers to correla- 
tion estimates based on subsamples of size 10. 

Since the X-axis of the graph represents subsample size, 
the last (rightmost) boxplot shows the distribution of 
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10 X 



5000 

150 



10 x 33 = 330 



estimates of serial correlation from independent subsamples 
of size ng = 150. Although the 330 estimates are independent 
of each other, they are not independent of the 5000 estimates 
that comprise the first boxplot since the same data (divided 
and processed in different ways) was used for both. Summary 
statistics show that the average correlation has dropped to 
-.007372, indicating the fall off in bias, and the standard 
deviation has dropped to .07822, indicating the greater 
accuracy with which the correlation can be estimated when 150 
points, rather than .10 , are available. 

In order to quantify the changes that are occurring in 
the mean and variance of the distribution of the estimator 
as subsample size changes, SIMTBl performs two types of 
regressions. The first regression is on the averages and is 
done after each replication, using the average serial corre- 
lation for that replication, r , as the dependent variable. 

i 

Inverse powers of the subsample size serve as the indepen- 
dent variables. For Figure 3(a) the degree of the regression 
was chosen to be D = 3 so, for each replication, the equations 
we attempt to fit by least squares are: 



r 



n 



i 




for i = 1,2 



8 . 



This form anticipates the general asymptotic expansion 
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0 + 



E(0(n) ) 




which holds true in the current situation with 0=0 and 
(in the Normal case) a = -1 (see Cramer (1948) for general 
results of this type) [Ref. 6]. 

Values of a^ , a^, a ^ , and a^ are calculated after each 
replication, averaged across the M replications to get a^, 
a^, and a , and then the averages are reported below the 

summary statistics on the line "Mean of Regression on Averages- 
Coefficients." We find that a Q = -.000272 and a^ = -1.03074, 
both close to their theoretical counterparts. 

Because we have 10 replications and therefore 10 inde- 
pendent values of each of a^ , a^, a^ , and a^ , we can also 
estimate the variances and standard deviations of a^, a^ , a 2 , 
and a^ across replications. These values are presented on 
the two lines immediately below the coefficients. For 
instance, the estimated s.d. of the estimate a^ = -.000272 
of a Q is .003892. 

The regression line for the mean value of the estimator 
is presented visually in the graph as a dotted curve. The 
estimated asymptote (i.e., a^) is printed with a dashed line 
wherever it does not coincide with the regression line. Bias, 
therefore, can be viewed as the difference between those two 
lines . 

The second regression referred to above is done after all 
replications have been run and the variances of the estimators 
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at each subsample size have been calculated. (Note that the 
standard deviations, not the variances, are presented in the 
summary statistics.) It should be recalled from previous 
discussion that these variances, as well as all measures in 
the summary statistics, are based on the grouping together 
of the serial correlations from all replications , at each 
subsample size. This is in contrast to the procedure for the 
regression on the means, where average correlations are com- 
puted for each subsample size for each replication . In the 
case of the variances, we have 8 equations: 



Var(r n ) 
i 





i = 1,2 



8 , 



which we fit by least squares in order to estimate the coeffi- 
cient B q , B 2 / and 83 in the presumed asymptotic expansion 



Var (0 (n) ) 



B q 6 1 B 2 ^ B 3 

~n + 3/2 + ~J 5/2 

n ' n n 



This expansion holds for the variance of the estimated 
serial correlation coefficient for independent data. Usually 
it will be Bg in which we are most interested since Bq is 
used in computing asymptotic relative efficiencies of esti- 
mators. For independent data with finite variance, we know 
that Bq = 1. The computed values of bg, b^, , and b^, 

are presented on the line labeled 'Regression on Variance — 
Coefficients’. Notice that bg = .7438 is close to the 
theoretical value of 1 . 
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The final two numbers on Figure 3(a), YMIN and YMAX , 
simply show the scale of the vertical axis. Because the SIMTB1 
program option to put Figures 3(a), 3(b) and 3(c) on the 
same scale was in effect, it may be that no boxplot in a 
given Figure (e.g.. Figure 3(b)) requires the full range of 
Y- values . 

In order to produce Figure 3(b), the Normal (0,1) data 

that went into Figure 1(a) was squared to create longer 
2 

tailed x (1) random variables. The output is entirely analo- 
gous to that for Figure 3(a). Similarly, for Figure 1(c), 
the Normal (0,1) data was exponentiated in order to create 
Lognormal (0,1) data and to produce analogous graphical 
output. The indication is that the distribution of the sample 
serial correlation is robust with respect to the population 
distribution. 

The features of the SIMTBl output will become clearer when 
they are associated with the various properties of the corre- 
lation estimator. First, however, a few technical comments 
concerning the regressions are necessary. 

C. SOME COMMENTS ON THE REGRESSIONS 

Two types of problems, numerical and statistical, can 
occur when attempting to fit the two sets of regression 
equations presented in Section VI. B. 

First, there is the question of numerical stability when 
the independent variables, {1, n^ ' n i ' n i } or 
{n^, decrease geometrically. If we 
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T 

attempt to form X X, where X is the respective design matrix 
T 

and X is the transpose of X, we get values that range from 

8 -6 

8 (assuming 8 subsample sizes) to £ n. for the regression 

8 -2 8 i =1 1 

on the means, and J n. to £ n~^ f or the regression on 

i=l 1 i=l 1 

the variances. Experience has shown that attempts to solve 

T 

systems with such extremes in the X X matrix produce errone- 
ous results. Instead, SIMTB1 scales the design matrices by 
multiplying each column of X by Max(n^) raised to the proper 
power so that no entry becomes too small. The standard 
Choleski decomposition is then used to fit the equations, 
and the coefficients are properly rescaled before they are 
reported. This procedure produces numerically reliable 
results . 

The second problem concerns the breakdown of statistical 
assumptions in our regression models. It has already been 
pointed out in Section VI .B that the two sets of dependent 
variables : 

(1) the 0(n^) when considering the regression on the means 



(2) the s' 



(n.) = 



M 

j = l 



(0j (n i ) - 0(n ± ) ) 
MJN/nJ 



when considering the regression on the variances, 

are not independent over i since all are based on the same 

simulated data. The extent of the dependence is demonstrated 

by the correlation matrix in Table 1. Entries in that table 

2 2 

show the estimated correlation between s (n^) and s (n^) for 
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TABLE 1 



ENTRIES IN THE TABLE ARE THE ESTIMATED CORRELATIONS 
BETWEEN THE ESTIMATED VARIANCES OF THE r AT 
DIFFERENT SUBSAMPLE SIZES: n i 

CORR(s 2 (r n ),s 2 (r n )) for i = j = 1,...,8 

i n j 



i 


1 


2 


3 


4 


5 


6 


7 


8 


j 


















1 


1.00 


.49 


.46 


-.26 


.18 - 


.17 


.14 


.01 


2 


.49 


1.00 


.40 


.55 


.11 


.38 


.38 


-.03 


3 


.46 


.40 


1.00 


.23 


.23 


.44 


.21 


.29 


4 


-.26 


.55 


.23 


1.00 


.42 


.86 


.57 


.35 


5 


.18 


.11 


.23 


.42 


1.00 


.71 


.43 


.59 


6 


-.17 


.38 


.44 


.86 


.71 1 


o 

o 

• 


.45 


.53 


7 


.14 


.38 


.21 


.57 


.43 


.45 


1.00 


.72 


8 


.01 


-.03 


.29 


.35 


.59 


.53 


.72 


1.00 


Recall 


that 


r is 
n 


the estimated 


serial 


correlation for 



simulated Normal (0,1) subsample of size n. Also, the 

estimated correlations shown above were computed using 10 

2 2 

values (replications) of s (r ) and s (r ) for each i and 
r n . n . 
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all i and j , where the estimation was done by repeating the 
SIMTBl experiment with 10 different batches of 50,000 simu- 
lated random variables. Since only 10 values went into each 
correlation calculation, the table is only accurate to 
within approximately ± 2//1Q = .632. We see some indication 
of positive correlation, especially when i and j are close, 
but the lack of independence is not severe enough to hurt 
the regression results for either the estimated means or 
variances significantly. 

A second assumption, implicit in any regression, is that 
the dependent variables have equal variances . This condition 
holds true for the means, which can be shown to satisfy 



Var(0(n i )) = | 



independently of i. The estimated variances, however, are not 

A 

equivalent and, if we assume the 0 ^ (n^) to be approximately 
Normally distributed so that 



M 



|N/n.| 

L I lJ (G.(n.) 
j = l 3 



0(n.) ) 



is approximately proportional to a | N / n i _ random variable, 

l 4 

we can compute 



Var(s 2 (n.)) = — j 

1 MNn . -n 

l l 
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2 

To correct this problem, SIMTB1 scales the s (n^) by >/n^ 
so that 



Var ( /n~ 



s 2 (n i ) ) 



2 

MN - n . 
1 



_ 2 _ 

MN 



since MN >> n^. The design matrix is scaled accordingly and 
the values b^ , b^, b^, and b^ discussed in Section VI. B. are 
reported. 

Table 2 shows the effects of the rescaling by presenting 

2 

first the estimated variances of the s (n^) , where the esti- 
mation is done by repeating SIMTBl for 10 batches of 50,000 
simulated data points . These estimated variances decrease 
as n^ increases, closely paralleling the second line of Table 

2 which has the approximate theoretical values (i.e., 

2 

2/ (MNn^ - n^) ) . The final line of Table 2 shows the estimated 

2 2 

variances of the rescaled s (n^) , i.e., the /nT s (n^) , 
which, as expected and hoped, show a more constant variance 
with i. 

Although future versions of SIMTBl will include more 
sophisticated regression routines and the ability to gener- 
ate independent samples at each subsample size, the SIMTBl 
is quick, usable, and accurate for most situations. 

D. INTERPRETING THE SERIAL CORRELATION RESULTS 

Returning to Figure 3(a) which shows the simulated dis- 
tribution of the serial correlation coefficient from indepen- 
dent, Normal (0,1) data, the following comments summarize the 
most striking features: 



